Objective
To showcase the minimum number of steps
required to do tertiary analysis of DNA + Protein
and some of the different ways to look at the data
Major questions answered:
Things not shown:
All available methods eg. Filtering of nearby variants, variant annotation, plots
Discussing all methods and their options - Documented here
Systemic variations seen in protein data - Next session
%load_ext autoreload
import missionbio.mosaic.io as mio
h5path = '/path/to/sample.h5'
sample = mio.load(h5path, raw=False)
H5 files are a replacement of loom files.
Where to get them?
These are part of the DNA and protein pipeline output
The sample h5 used in this workflow can be found here
Note: This is a h5 file trimmed specifically for this analysis
Dna, Cnv, and Protein are sub classes of the Assay class
The information is stored in four ways, and the user
can change each of those
1. metadata (add_metadata / del_metadata):
dictionary containing metrics of the assay
2. row_attrs (add_row_attr / del_row_attr):
dictionary which contains 'barcode' as one of
the keys. All the values must be of the same
length i.e. match the number of barcodes
This is the attribute where 'label', 'pca',
and 'umap' values are added
3. col_attrs (add_col_attr / del_col_attr):
dictionary which contains 'ids' as one of
the keys. All the values must be of the same
length i.e. match the number ids
'ids' contains variants for DNA assays
and anitobides for Protein assays
4. layers (add_layer / del_layer):
dictionary containing 'read_counts' as one of
the metrics. All the values have the shape
(num barcodes) x (num ids). This is the attribute
where 'normalized_counts' will be added
Sample holds the Dna and Protein information
sample.protein
sample.protein.metadata
sample.protein.row_attrs
sample.protein.ids()
sample.dna.layers
Topcis covered
Many filtering options are available
use the documentation shared earlier,
or the help() function to get the same
information here
help(sample.dna.filter_variants)
# Filter variants
# This is the default insights filtering method
dna_vars = sample.dna.filter_variants()
dna_vars
# Check the number of filtered variants
len(dna_vars)
Simply appnding the whitelist to the list of filtered
variants is sufficient to then select the variants
using the slice notation
i.e. sample.dna[{list of barcodes}, {list of ids}]
whitelist = ['chr1:115256513:G/A', 'chr21:44514718:C/T']
final_vars = whitelist + list(dna_vars)
len(final_vars)
sample.dna.shape
# Selecting all cells and final variants
sample.dna = sample.dna[sample.dna.barcodes(), final_vars]
# Check the shape i.e. (Number of barcodes, number of ids)
# of the final filtered dna object
sample.dna.shape
Heatmaps are interactive. Clicking on it selects
the corresponding id whose value is stored in the
`selected_ids` attribute of the object
eg. sample.dna.selected_ids
sample.dna.stripplot(layer='AF', color_layer='GQ')
sample.dna.heatmap(layer='AF')
sample.dna.selected_ids
sample.dna = sample.dna.drop(sample.dna.selected_ids)
DNA has a custom clustering method called `find_clones`
It projects the data on a UMAP and then performs
dbscan to identify unique clusters, which are then
merged in case they were formed due to missing
information
sample.dna.find_clones()
sample.dna.row_attrs
sample.dna.scatterplot(attribute='umap', colorby='label')
sample.dna.heatmap(layer='AF')
1. Basic filtering of barcodes ids demonstrated
2. Basic DNA filtering functionality showcased
Preliminary heatmap of CNV shows that there could be two clusters
Topics covered
sample.cnv.heatmap(layer='normalized_counts')